Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing

نویسندگان

  • Matei Zaharia
  • Tathagata Das
  • Haoyuan Li
  • Timothy Hunter
  • Scott Shenker
  • Ion Stoica
چکیده

Many “big data” applications need to act on data arriving in real time. However, current programming models for distributed stream processing are relatively low-level, often leaving the user to worry about consistency of state across the system and fault recovery. Furthermore, the models that provide fault recovery do so in an expensive manner, requiring either hot replication or long recovery times. We propose a new programming model, discretized streams (D-Streams), that offers a high-level functional API, strong consistency, and efficient fault recovery. D-Streams support a new recovery mechanism that improves efficiency over the traditional replication and upstream backup schemes in streaming databases— parallel recovery of lost state—and unlike previous systems, also mitigate stragglers. We implement D-Streams as an extension to the Spark cluster computing engine that lets users seamlessly intermix streaming, batch and interactive queries. Our system can process over 60 million records/second at sub-second latency on 100 nodes.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Many important “big data” applications need to process data arriving in real time. However, current programming models for distributed stream processing are relatively low-level, often leaving the user to worry about consistency of state across the system and fault recovery. Furthermore, the models that provide fault recovery do so in an expensive manner, requiring either hot replication or lon...

متن کامل

Large-Scale Online Expectation Maximization with Spark Streaming

Many “Big Data” applications in Machine Learning (ML) need to react quickly to large streams of incoming data. The standard paradigm nowadays is to run ML algorithms on frameworks designed for batch operations, such as MapReduce or Hadoop. By design, these frameworks are not a good match for low-latency applications. This is why we explore using a new, recently proposed model for large-scale st...

متن کامل

Replication Schemes to Support Failure Resilient Processing of Real Time Data Streams

In this paper we explore the use of replication for fault tolerant processing of streams. We perform these experiments in the context of the Granules stream processing system that is designed for real time processing of data streams generated by devices and instruments. In this paper we explore well-known replication schemes for fault tolerant processing of data streams. We analyze two basic ap...

متن کامل

WAVES: Big Data Platform for Real-time RDF Stream Processing

Processing data as they arrive has recently gained momentum to mine continuous, high-volume and unbounded sequence of data streams. Due to the heterogeneity and the multi-modality of this data, RDF is widely used to provide a unified metadata layer in streaming context. In response to this ever-increasing demand, a number of systems and languages were produced, aiming at RDF stream processing (...

متن کامل

Robust Security Mechanisms for Data Streams Systems

Stream database systems are designed to support the fast on-line processing that characterizes many new emerging applications such as pervasive computing, sensor-based environments, on-line business processing and network monitoring. The sensitive nature of the data and the high-demands environment where data can be lost or dropped because of limited buffer storage or real-time constraints, req...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012